Principal Component Analysis


PCA is an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. PCA analysis is perfromed either using EVD of the covariance matrix or the SVD of the mean-centered data.

Prerequisites

The reader should be familiar with linear algebra and statistics concepts.

Competences

The reader should be able to perform PCA on a given data set.

References

For more details see

Definitions

A data matrix is a matrix $X\in\mathbb{R}^{m\times n}$, where each column corresponds to a feature (say, certain gene), and each row correspond to an observation (say, individual).

A mean of a vector $x\in\mathbb{R}^{n}$ is $\mu(x)=\displaystyle \frac{x_1+x_2+\cdots x_n}{n}$.

A standard deviation of a vector $x$ is $\sigma(x)=\displaystyle \sqrt{\frac{\sum_{i=1}^n (x_i-\mu(x))^2}{n-1}}$. A variance of a vector $x$ is $var(x)=\sigma^2(x)$.

A vector of means of a data matrix $X$ is a row-vector of means of the columns of $X$, $\mu(X)=\begin{bmatrix}\mu(X_{:,1}) & \mu(X_{:,2}) & \ldots & \mu(X_{:,n})\end{bmatrix}$.

A zero-mean centered data matrix is a matrix $\bar X$ obtained from a data matrix $X$ by subtracting from each column the mean of this column, $$ \bar X= \begin{bmatrix} X_{:,1}-\mu(X_{:,1}) & X_{:,2}-\mu(X_{:,2}) & \cdots & X_{:,n}-\mu(X_{:,n}) \end{bmatrix}\equiv X-\mathbf{1}\mu(X), $$ where $\mathbf{1}=[1,1,\ldots,1]^T$.

A covariance matrix of a data matrix $X$ is a matrix $$ \mathop{\mathrm{cov}}(X)=\displaystyle \frac{1}{n-1}[X-\mathbf{1}\mu(X)]^T[X-\mathbf{1}\mu(X)] \equiv \frac{\bar X^T \bar X}{n-1}. $$

Facts

Given a data matrix $X$, let $\mathop{\mathrm{cov}}(X)=U\Lambda U^T$ be the EVD with non-increasingly ordered eigenvalues, $\lambda_1\geq \lambda_2\geq \cdots \geq \lambda_n$.

  1. $\mathop{\mathrm{cov}}(X)$ is a symmetric PSD matrix.

  2. $\mathop{\mathrm{cov}}(X)=\mathop{\mathrm{cov}}(\bar X)$.

  3. Let $T=\bar X U$. The columns of $T$ are the principal components of $\bar X$. In particular:

    1. The first principal component of $\bar X$ (or $X$) is the first column, $T_{:,1}$. It is a projection of the zero-mean centered data set $\bar X$ on the line defined by $U_{:,1}$. This is the direction along which the data have the largest variance.
    2. The second column (the second principal component), $T_{:,1}$, is a projection of $\bar X$ on the line defined by $U_{:,2}$, which is orthogonal to the first projection. This is direction with the largest variance after subtracting the first principal component from $\bar X$.
    3. The $k$-th principal component is the direction with the largest variance after subtracting the first $k-1$ principal components from $\bar X$, that is, the first principal component of the matrix $$ \hat X=\bar X-\sum_{i=1}^{k-1} \bar X U_{:,i} U_{:,i}^T. $$
  4. Let $\bar X=\bar U \Sigma V^T$ be the SVD of $\bar X$. Then $V=U$ and $T=\bar U\Sigma V^T V\equiv \bar U \Sigma$.

  5. Reconstruction of the principal components is the following:

    1. Full reconstruction is $X=T U^T +\mathbf{1} \mu(X)$.
    2. Reconstruction from the first $k$ principal components is $\tilde X =T U_{:,1:k}^T +\mathbf{1} \mu(X)$.
  6. Partial reconstructions can help obtaining various insights about the data. For example, the rows of the matrix $T_{:,1:k}$ can be clustered by the $k$-means algorithm, and the points defined by first three columns of $T$ can be plotted to visualize projections of clusters. Afterwards, the computed clusters can be mapped back to original data.

  7. Heuristical guess for number of important clusters is given be the location of the "knee" in the plot of the singular values of $\bar X$.

Example - Elliptical data set

We generate a "quasi" elliptical set of points and compute its principal components.


In [2]:
# Generate data points
using Random
Random.seed!(456)
n=3
m=500
ax=[8,3,1]
X=Array{Float64}(undef,m,n)
for i=1:n
    X[:,i]=rand!(X[:,i])*ax[i]
end
# Parameters
u=(rand(m).-0.5)*π
v=(rand(m).-0.5)*2*π
for i=1:m
    X[i,1]=X[i,1]*cos(u[i])*cos(v[i])
    X[i,2]=X[i,2]*cos(u[i])*sin(v[i])
    X[i,3]=X[i,3]*sin(u[i])
end

In [3]:
X0=copy(X)


Out[3]:
500×3 Array{Float64,2}:
  0.210576    -0.127732    0.776479   
  0.591569    -0.876754    0.262941   
  3.71806     -0.0563136  -0.024527   
 -3.22498     -0.652015   -0.0999344  
 -0.00225668   0.184277   -0.343653   
  0.592944     2.66002     0.0073981  
 -0.0503422   -0.0952142  -0.00904418 
  0.370151    -0.153789   -0.264933   
 -3.27693      0.339955    0.404337   
 -2.30723      0.751473    0.399126   
  2.48328     -0.579145   -0.10629    
  0.460629    -0.0970859   0.354729   
  3.38002      0.760876   -0.111974   
  ⋮                                   
 -5.24438     -0.121456    0.0163747  
 -3.93267      0.63387    -0.0962625  
 -1.26243      0.0279709   0.248269   
 -0.0406195   -0.558751   -0.230258   
  2.57882      0.0022758   0.141931   
  4.36542      0.94466    -0.420983   
 -0.55316      0.142784    0.00606709 
 -2.59395      1.04655    -0.724932   
  0.157457    -0.60424     0.664749   
 -4.03607     -0.702379   -0.127965   
 -0.682568     1.11439    -0.181113   
 -1.60667     -1.75284     0.000999079

In [4]:
using PyPlot

In [5]:
plot3D(X0[:,1],X0[:,2],X0[:,3],"+")


Out[5]:
1-element Array{PyCall.PyObject,1}:
 PyObject <mpl_toolkits.mplot3d.art3d.Line3D object at 0x0000000025F3DDD8>

In [6]:
# Compute the means. How good is the RNG?
using Statistics
μ0=mean(X,dims=1)


Out[6]:
1×3 Array{Float64,2}:
 -0.0920898  0.0422521  -0.0313562

In [7]:
# Subtract the means
using LinearAlgebra
X=X.-μ0
# Rotate by a random orthogonal matrix
Q,r=qr(rand(3,3))
X=X*Q;

In [8]:
# Translate
S=[3,-2,4]


Out[8]:
3-element Array{Int64,1}:
  3
 -2
  4

In [9]:
X.+=S'
# plot3D(X0[:,1],X0[:,2],X0[:,3],"b+")
plot3D(X[:,1],X[:,2],X[:,3],"rx" )


Out[9]:
1-element Array{PyCall.PyObject,1}:
 PyObject <mpl_toolkits.mplot3d.art3d.Line3D object at 0x0000000022D9A5F8>

In [10]:
C=cov(X)


Out[10]:
3×3 Array{Float64,2}:
 0.619018    0.0889886   0.243518
 0.0889886   3.36625    -2.70555 
 0.243518   -2.70555     2.65804 

In [11]:
μ=mean(X,dims=1)


Out[11]:
1×3 Array{Float64,2}:
 3.0  -2.0  4.0

In [12]:
# Fact 2
cov(X.-μ), (X.-μ)'*(X.-μ)/(m-1)


Out[12]:
([0.619018 0.0889886 0.243518; 0.0889886 3.36625 -2.70555; 0.243518 -2.70555 2.65804], [0.619018 0.0889886 0.243518; 0.0889886 3.36625 -2.70555; 0.243518 -2.70555 2.65804])

In [13]:
# Principal components, evals are non-decreasing
λ,U=eigen(C)


Out[13]:
Eigen{Float64,Float64,Array{Float64,2},Array{Float64,1}}
eigenvalues:
3-element Array{Float64,1}:
 0.15669814787095282
 0.7441175461904636 
 5.742492308693448  
eigenvectors:
3×3 Array{Float64,2}:
  0.464569  0.885347   0.0183332
 -0.578243  0.318972  -0.750928 
 -0.670679  0.338257   0.66013  

In [14]:
# Largest principal component
T1=(X.-μ)*U[:,3]
subplot(221)
plot(T1,zero(T1),"rx")
subplot(222)
plot(T1,"b+")


Out[14]:
1-element Array{PyCall.PyObject,1}:
 PyObject <matplotlib.lines.Line2D object at 0x0000000022E45908>

In [15]:
# Two largest principal components
T2=(X.-μ)*U[:,[3,2]]
axis("equal")
plot(T2[:,1],T2[:,2],"r+")


Out[15]:
1-element Array{PyCall.PyObject,1}:
 PyObject <matplotlib.lines.Line2D object at 0x0000000026614860>

In [16]:
# All  three principal components
T=(X.-μ)*U[:,[3,2,1]]
plot3D(T[:,1],T[:,2],T[:,3],"rx" )


Out[16]:
1-element Array{PyCall.PyObject,1}:
 PyObject <mpl_toolkits.mplot3d.art3d.Line3D object at 0x000000002BE7FBA8>

In [17]:
# Fact 5 - Recovery of the largest component
Y1=T1*U[:,3]'.+μ
plot3D(Y1[:,1],Y1[:,2],Y1[:,3],"rx" )


Out[17]:
1-element Array{PyCall.PyObject,1}:
 PyObject <mpl_toolkits.mplot3d.art3d.Line3D object at 0x000000002BEC5588>

In [18]:
# Recovery of the two largest components
Y2=T2*U[:,[3,2]]'.+μ
plot3D(Y2[:,1],Y2[:,2],Y2[:,3],"rx" )


Out[18]:
1-element Array{PyCall.PyObject,1}:
 PyObject <mpl_toolkits.mplot3d.art3d.Line3D object at 0x000000002BF43320>

In [19]:
# Recovery of all three components (exact)
Y3=T*U[:,[3,2,1]]'.+μ
plot3D(Y3[:,1],Y3[:,2],Y3[:,3],"rx" )
plot3D(X[:,1],X[:,2],X[:,3],"b+" )


Out[19]:
1-element Array{PyCall.PyObject,1}:
 PyObject <mpl_toolkits.mplot3d.art3d.Line3D object at 0x000000002BFB7E10>

In [20]:
# Fact 4 - PCA using SVD
function myPCA(X::Array{T}, k::Int) where T
    μ=mean(X,dims=1)
    U,σ,V=svd(X.-μ)
    U[:,1:k]*Diagonal(σ[1:k])
end


Out[20]:
myPCA (generic function with 1 method)

In [21]:
T1s=myPCA(X,1)
[T1s T1]


Out[21]:
500×2 Array{Float64,2}:
 -0.299962   -0.299962 
 -0.677274   -0.677274 
 -3.80944    -3.80944  
  3.13705     3.13705  
 -0.0913534  -0.0913534
 -0.701393   -0.701393 
 -0.0408378  -0.0408378
 -0.46147    -0.46147  
  3.18378     3.18378  
  2.21151     2.21151  
 -2.57156    -2.57156  
 -0.551052   -0.551052 
 -3.47671    -3.47671  
  ⋮                    
  5.1533      5.1533   
  3.83665     3.83665  
  1.17097     1.17097  
 -0.048093   -0.048093 
 -2.67025    -2.67025  
 -4.46387    -4.46387  
  0.460503    0.460503 
  2.4941      2.4941   
 -0.244074   -0.244074 
  3.94838     3.94838  
  0.583425    0.583425 
  1.52589     1.52589  

In [22]:
# The two largest components using SVD
T2s=myPCA(X,2)
[T2s T2]


Out[22]:
500×4 Array{Float64,2}:
 -0.299962   -0.160003    -0.299962    0.160003  
 -0.677274   -0.918847    -0.677274    0.918847  
 -3.80944    -0.122284    -3.80944     0.122284  
  3.13705    -0.67559      3.13705     0.67559   
 -0.0913534   0.136863    -0.0913534  -0.136863  
 -0.701393    2.61372     -0.701393   -2.61372   
 -0.0408378  -0.137383    -0.0408378   0.137383  
 -0.46147    -0.202334    -0.46147     0.202334  
  3.18378     0.323977     3.18378    -0.323977  
  2.21151     0.729302     2.21151    -0.729302  
 -2.57156    -0.638525    -2.57156     0.638525  
 -0.551052   -0.137113    -0.551052    0.137113  
 -3.47671     0.695634    -3.47671    -0.695634  
  ⋮                                              
  5.1533     -0.130762     5.1533      0.130762  
  3.83665     0.614611     3.83665    -0.614611  
  1.17097    -0.00285697   1.17097     0.00285697
 -0.048093   -0.604167    -0.048093    0.604167  
 -2.67025    -0.0541343   -2.67025     0.0541343 
 -4.46387     0.868697    -4.46387    -0.868697  
  0.460503    0.103952     0.460503   -0.103952  
  2.4941      1.00964      2.4941     -1.00964   
 -0.244074   -0.637758    -0.244074    0.637758  
  3.94838    -0.721286     3.94838     0.721286  
  0.583425    1.0735       0.583425   -1.0735    
  1.52589    -1.78491      1.52589     1.78491   

Example - Real data

We will cluster three datasets from Workshop "Principal Manifolds-2006":

Data set # of genes (m) # of samples (n)
D1 17816 286
D2 3036 40
D3 10383 103

In [23]:
# Data set D1
using DelimitedFiles
f=readdlm("files/d1.txt")


Out[23]:
17817×287 Array{Any,2}:
 "CHIP"                        "GSM36777"  …    "GSM37061"    "GSM37062"
 "1007_s_at"                  0.12             0            -0.43       
 "1053_at"                    0.14             0.46         -0.2        
 "117_at"                    -0.14            -0.8          -0.29       
 "121_at"                    -0.11             0.21          0.24       
 "1255_g_at"                 -0.73         …   0.52          0.72       
 "1294_at"                   -0.26            -0.13         -1.06       
 "1316_at"                   -0.05            -0.55         -0.05       
 "1405_i_at"                 -0.85            -2.92         -0.73       
 "1431_at"                   -0.67             0.67          0.58       
 "1438_at"                   -0.49         …   0.87         -1.63       
 "1487_at"                   -0.47            -0.11          0.32       
 "1494_f_at"                  3.97            -0.61         -0.17       
 ⋮                                         ⋱   ⋮                        
 "AFFX-r2-Ec-bioC-3_at"      -0.59         …   0.03         -0.27       
 "AFFX-r2-Ec-bioC-5_at"      -0.75            -0.01         -0.44       
 "AFFX-r2-Ec-bioD-3_at"      -0.59             0.01         -0.43       
 "AFFX-r2-Ec-bioD-5_at"      -0.7              0.14         -0.39       
 "AFFX-r2-Hs18SrRNA-3_s_at"   0.23            -1.98          0.23       
 "AFFX-r2-Hs18SrRNA-5_at"     1.2          …  -1.8           0.2        
 "AFFX-r2-Hs18SrRNA-M_x_at"   0.57            -0.86          0.61       
 "AFFX-r2-Hs28SrRNA-3_at"     1.01            -0.94          0.21       
 "AFFX-r2-Hs28SrRNA-5_at"     0.47            -0.85          0.54       
 "AFFX-r2-Hs28SrRNA-M_at"     0.63            -1.18          0.59       
 "AFFX-r2-P1-cre-3_at"       -0.46         …   0.09         -0.26       
 "AFFX-r2-P1-cre-5_at"       -0.52             0.06         -0.28       

In [24]:
sizeof(f)


Out[24]:
40907832

In [25]:
X=map(Float64,f[2:end,2:end])


Out[25]:
17816×286 Array{Float64,2}:
  0.12   0.88   0.58   0.19   0.27  …  -0.2   -0.35  -0.24   0.0   -0.43
  0.14  -0.88  -0.22   0.94   1.01      0.36   0.02  -0.37   0.46  -0.2 
 -0.14  -0.31   0.2    0.41   0.48      0.56  -0.05   0.1   -0.8   -0.29
 -0.11   0.94   0.15   0.02  -0.06      0.05  -0.02   0.12   0.21   0.24
 -0.73   1.46  -0.32   1.03   0.39     -0.17  -0.23   1.47   0.52   0.72
 -0.26   0.28  -0.24  -0.21  -0.01  …  -0.18   0.38   0.01  -0.13  -1.06
 -0.05   0.6    0.0    0.04   0.02     -0.41  -0.6    0.0   -0.55  -0.05
 -0.85  -0.31  -2.45   0.69   2.71      0.05   0.75  -0.33  -2.92  -0.73
 -0.67   0.63   0.11   0.19   0.28     -0.83  -0.01  -0.1    0.67   0.58
 -0.49  -0.98  -0.92   2.08   1.26      0.16  -0.23  -0.19   0.87  -1.63
 -0.47   0.08   0.13   0.09   0.67  …   0.0    0.14   0.5   -0.11   0.32
  3.97   0.52   4.93  -0.25   0.16     -0.54  -0.45   0.16  -0.61  -0.17
  0.11   0.37  -0.38  -0.23  -0.72      0.43   0.41  -0.49  -0.28  -0.3 
  ⋮                                 ⋱                               ⋮   
 -0.59   0.36   0.24   0.54   0.22      0.29   0.04   0.44   0.03  -0.27
 -0.75  -0.02   0.32   0.42   0.23  …   0.22  -0.09   0.62  -0.01  -0.44
 -0.59   0.25   0.4   -0.06  -0.14      0.36   0.13   0.74   0.01  -0.43
 -0.7    0.06   0.46  -0.07   0.03      0.39   0.1    0.64   0.14  -0.39
  0.23  -0.01   1.81   0.64  -1.27     -0.32  -0.57   0.1   -1.98   0.23
  1.2   -0.02   2.71   0.48  -4.13     -0.39  -1.04  -0.56  -1.8    0.2 
  0.57   0.72   2.45   0.61  -2.12  …  -0.72  -0.21  -0.13  -0.86   0.61
  1.01  -0.4    2.13   0.98  -1.13     -0.18  -0.62   0.22  -0.94   0.21
  0.47   0.19   1.86   0.8   -0.7      -0.37   0.27  -0.07  -0.85   0.54
  0.63   0.75   1.48   0.91  -1.16      0.06  -0.37   0.32  -1.18   0.59
 -0.46   0.25   0.42   0.45   0.33      0.34   0.13   0.6    0.09  -0.26
 -0.52   0.15   0.47   0.35   0.15  …   0.23   0.09   0.67   0.06  -0.28

In [26]:
# Clear a big variable
f=[]


Out[26]:
0-element Array{Any,1}

In [27]:
# Fact 7 - Plot σ's and observe the knee
μ=mean(X,dims=1)
σ=svdvals(X.-μ)
plot(σ)


Out[27]:
1-element Array{PyCall.PyObject,1}:
 PyObject <matplotlib.lines.Line2D object at 0x0000000001CAB2B0>

In [28]:
# PCA on X, keep 20 singular values
k=20
T=myPCA(X,k)


Out[28]:
17816×20 Array{Float64,2}:
 -2.60997     1.29427    3.82777   …   0.416223   -0.908852    -0.339937 
  2.53162     6.72032    1.25924       0.950754   -0.53449      0.0961831
  2.63868     2.01813    2.13187       1.72483     0.513846    -0.0914956
 -0.631409   -0.94409    1.88523       0.440639   -0.745012     0.0996248
 -0.8162      0.174124   3.59941       1.30466    -2.34368      0.369603 
 -0.274186   -1.29081   -1.33486   …  -0.473495   -0.62421     -1.26462  
  0.401881   -0.738539   1.34847       0.39025    -0.00439161   0.0151878
 21.7269     -4.06805   -9.91123      -2.0272     -1.41633     -0.0263997
 -2.58371    -0.156413  -0.737911     -0.265203    1.58469      0.322162 
  4.17899    17.0184     1.83096       2.91487    -0.433558     1.164    
  0.772371    0.155414   1.66253   …  -0.833175    0.397829    -0.700775 
 -5.42643    -0.788577   1.64396      -1.44325     0.387156     0.840406 
  0.500175   -4.16445   -3.51728       0.221502    0.00109187  -1.79048  
  ⋮                                ⋱                                     
  0.602363   -1.58106    1.88832      -0.291308   -0.863038    -1.15118  
  0.539476   -1.8133     2.05333   …  -0.31821    -1.60131     -1.2452   
  0.485803   -1.87352    2.20944      -0.164564   -0.654407    -1.2894   
  0.475569   -1.44394    2.10787      -0.344536   -1.242       -1.30914  
  5.56309    -2.43271    4.40758      -2.06542     0.563161    -1.10901  
  5.82127     1.07777    2.81812      -2.12045    -0.50111     -0.404985 
  4.46274    -2.24485    4.53101   …  -1.62547    -0.859139    -0.507143 
  4.32701     1.4561     3.72729      -1.62872    -0.24935     -0.80077  
  3.20833    -3.0066     3.52118      -1.18471    -0.67259     -0.423825 
  3.35449    -3.69571    2.98001      -1.24199     0.215023    -0.144022 
 -0.0676899  -1.15867    2.49808      -0.213136   -1.55336     -0.767101 
  0.431614   -1.23256    2.16338   …   0.0266245  -1.6841      -0.930856 

In [29]:
# Find k clusters
using Clustering
out=kmeans(T',k)


Out[29]:
KmeansResult{Array{Float64,2},Float64,Int64}([-6.75249 -0.662165 … -1.1111 0.0884205; -0.592097 0.354341 … 0.121346 -2.65327; … ; 0.223095 0.110968 … 0.655065 0.422075; 0.0910122 -0.0124653 … 0.315076 -0.348167], [11, 10, 11, 2, 2, 17, 2, 3, 2, 5  …  17, 17, 9, 9, 9, 9, 9, 9, 2, 17], [16.837, 36.2713, 25.6976, 15.6004, 26.6287, 26.4335, 10.5758, 294.958, 21.645, 145.355  …  45.5321, 65.0738, 71.2243, 105.03, 94.0346, 57.3524, 58.6991, 64.8662, 55.6902, 52.3087], [908, 2867, 575, 1846, 423, 174, 361, 451, 430, 1330, 1884, 67, 109, 1760, 1353, 696, 1444, 80, 654, 404], [908, 2867, 575, 1846, 423, 174, 361, 451, 430, 1330, 1884, 67, 109, 1760, 1353, 696, 1444, 80, 654, 404], 877003.9289306196, 80, true)

In [30]:
# Plot points defined by the first three principal components
function myPlot(C::Vector, T::Array, k::Int)
    P=Array{Any}(undef,k)
    for j=1:k
        P[j]=T[C.==j,1:3]
        plot3D(P[j][:,1],P[j][:,2],P[j][:,3],"x")
    end
end


Out[30]:
myPlot (generic function with 1 method)

In [31]:
myPlot(out.assignments,T,k)



In [32]:
# Data set D2
X=readdlm("files/d2fn.txt")


Out[32]:
3036×40 Array{Float64,2}:
  0.0681   0.072    0.3944  -0.0606  …  -0.3723  -0.2963  -0.619   -0.4171
  0.1402   0.2517   0.0843   0.1794     -0.1204   0.0086  -0.8177  -0.3646
  0.0081   0.3179   0.5802   0.0162     -0.2968  -0.3981  -1.055   -0.4583
  0.1049   0.3031   0.3048   0.2214     -0.288   -0.0841  -0.7231  -0.2951
  0.0609   0.4262   0.104    0.227      -0.3255  -0.1652  -0.6634  -0.4358
 -0.1163   0.2005  -0.2816   0.1192  …  -0.2508  -0.034   -0.5112  -0.1964
  0.1276   0.4596   0.3751   0.2379     -0.4025  -0.3775  -0.6091  -0.4265
  0.1362   0.4215   0.0645   0.3017     -0.3727  -0.3352  -0.642   -0.4601
  0.1455   0.3676   0.1463   0.2997     -0.4893  -0.5417  -0.6408  -0.5108
 -0.7093   0.6045  -0.6372  -0.6594     -0.1644  -0.0441   0.4404   0.4663
 -0.0044   0.1012   0.1362   0.432   …  -0.1594   0.4027  -0.5591  -0.1365
  0.1288   0.6568   0.7199   0.2173     -0.2006   0.0184  -1.1106   0.1015
  0.5014   0.6976   0.9516   0.6017     -1.168   -2.0287  -1.0887  -0.5792
  ⋮                                  ⋱                                    
  0.2222   0.0015   0.0284  -0.3749      0.2817  -0.4384   0.2771   0.7985
  0.2587   0.1443  -0.1444   0.4272  …  -0.4435  -0.6569  -0.6596  -0.4899
  0.3712   0.3939   0.4392   0.4054     -0.7037  -0.9145  -0.8822  -0.7753
  0.3355   1.0494   1.1719   0.5783     -0.7814  -1.0491  -0.7743   0.041 
  0.3646   0.5563   0.375    0.3172     -0.5824  -0.7064  -0.8294  -0.8036
  0.4069   0.7489   0.1103   0.0303     -0.4834  -0.6479  -0.7976  -0.431 
 -0.5085   0.5377   0.6047  -0.5306  …  -0.2162  -1.8589   0.2188   0.2223
  0.2777  -0.403   -0.0952   0.2562     -0.303    0.5055  -0.8516   0.2553
 -0.0092  -0.1585   0.156   -0.6566      0.1016  -0.2363  -0.0053  -0.3698
  0.1618   0.0412  -0.3044   0.0847      0.5937  -0.0567   0.9136  -0.5945
  0.3653  -0.1373   0.0607   0.2313      0.3029   0.6607  -0.2687  -0.7253
  0.1646  -0.3546  -0.2455  -0.0452  …  -0.2054   0.2651  -0.3522  -0.7788

In [33]:
μ=mean(X,dims=1)
σ=svdvals(X.-μ)
plot(σ)


Out[33]:
1-element Array{PyCall.PyObject,1}:
 PyObject <matplotlib.lines.Line2D object at 0x0000000001D84A90>

In [34]:
k=4
T=myPCA(X,k)
out=kmeans(T',k)
myPlot(out.assignments,T,k)


We use the package CSV.jl and replace the "NULL" string with missing.


In [35]:
using CSV

In [36]:
sf = CSV.read("files/d3efn.txt", header=0,missingstring="NULL")


Out[36]:

10,387 rows × 103 columns (omitted printing of 95 columns)

Column1Column2Column3Column4Column5Column6Column7Column8
Float64⍰Float64⍰Float64⍰Float64⍰Float64⍰Float64⍰Float64⍰Float64⍰
10.179-0.2011.2890.0390.8691.139-0.421-0.251
21.0420.1620.6120.4720.3821.3920.2820.612
3-0.022-0.482missing-0.182missing0.988-0.202-0.572
4-0.5970.1830.593-0.4670.273missing0.283-0.387
50.6990.6390.7390.349missing-0.0810.229-0.281
61.035-3.015-2.255-2.725-2.0950.645-0.295-0.875
70.64-4.9-4.24-4.34-3.390.54-2.6-2.49
80.904-2.636-2.446-1.996missing1.094-1.236-1.096
90.74-2.15missing-3.43missing0.3-1.73-1.57
100.771missing-1.389-1.249missing0.991-1.269-0.739
110.351-0.829-1.239-0.679-1.0890.411-0.899-0.489
120.926-1.034missing-0.654missing0.716-0.254-0.964
13-0.216-0.1260.4840.0240.134-0.606-0.296-0.226
14-0.154-0.884missing-0.414missing-0.414-0.694-0.454
15-0.796-2.096-1.076-0.8060.074-1.136-1.536-0.776
16-0.791-1.751-0.591-0.8210.109-0.811-1.481-0.671
17-0.89-1.78-0.12-0.360.6-1.32-1.3-0.05
18-1.192-2.292-0.992-0.972-1.152-0.772-1.542-0.692
19-1.441-1.681-0.491-1.681-0.311missing-1.641-1.471
20-1.103-0.823-1.403-1.973missing-1.293-1.153-0.243
211.668-0.3220.6881.1581.5880.3980.408-0.202
221.485-0.0750.8051.0851.7850.2750.405-0.065
23-0.3480.032missing-0.4880.212-0.048-0.1380.372
24-0.281-0.5710.029missing0.329missing-0.211-0.141
25-0.088-0.688missing-0.668-0.3980.122-0.0280.352
26-0.289missing-0.309missing-0.249-0.4690.411-0.269
270.390.19-0.62-0.040.09-0.040.130.14
28-0.48-1.0-1.57-1.19-0.66-0.8-0.72-0.76
290.446-1.044-0.444-0.3140.6860.106-0.534-0.164
300.557-1.1630.267-0.4230.9270.287-0.4330.497
&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip

In [37]:
# sf[:,103]

In [38]:
typeof(sf)


Out[38]:
DataFrames.DataFrame

In [39]:
# Set missing values to zero
X=coalesce.(Matrix(sf[:,1:end-1]),0.0)


Out[39]:
10387×102 Array{Float64,2}:
  0.179  -0.201   1.289   0.039   0.869  …  -0.961  -0.821  -0.551   1.279
  1.042   0.162   0.612   0.472   0.382     -0.788  -0.598  -0.078   0.0  
 -0.022  -0.482   0.0    -0.182   0.0        0.688   0.758  -0.002   0.108
 -0.597   0.183   0.593  -0.467   0.273      0.253  -0.187  -0.707   0.393
  0.699   0.639   0.739   0.349   0.0       -0.711  -3.391   0.769   1.519
  1.035  -3.015  -2.255  -2.725  -2.095  …  -1.235  -3.185  -0.895   1.845
  0.64   -4.9    -4.24   -4.34   -3.39      -1.04   -1.72   -1.41    2.62 
  0.904  -2.636  -2.446  -1.996   0.0       -0.986  -0.986  -0.976   1.114
  0.74   -2.15    0.0    -3.43    0.0       -1.17   -0.95   -1.78    1.87 
  0.771   0.0    -1.389  -1.249   0.0       -1.829   0.0    -1.049   0.391
  0.351  -0.829  -1.239  -0.679  -1.089  …  -1.139  -0.279  -0.789   0.441
  0.926  -1.034   0.0    -0.654   0.0        0.726   0.166  -0.104   0.0  
 -0.216  -0.126   0.484   0.024   0.134     -1.206   0.074  -0.226   1.434
  ⋮                                      ⋱                   ⋮            
 -0.995  -0.945   0.0     0.0    -0.145  …   0.205   0.0     0.295   0.0  
 -2.323  -2.043  -1.213  -2.633  -0.993      0.857  -0.743  -1.683   0.0  
  1.511   0.681   1.601   0.961   2.641      1.141   0.401  -1.119   0.521
  1.007   0.197   1.057   0.547   1.657      1.287  -0.043  -0.213   0.547
  0.968   0.688   1.278   0.838   1.038      2.228  -0.102  -1.342  -1.232
  1.346   1.396   2.616   1.506   2.366  …  -1.834   0.186  -0.004  -5.214
 -0.186   1.114   0.814   0.274   0.0        0.614  -0.266   0.204  -1.236
 -0.985  -1.715  -1.405  -1.605  -0.425      0.855   6.205  -0.155  -1.565
 -0.135  -1.315   0.225  -2.025   0.095      0.275   0.055  -0.605  -1.735
  0.095  -0.845  -0.895   0.0    -0.445      0.885   0.455   0.255  -0.695
  0.173  -0.057   0.0     0.243   0.0    …   0.263   0.143   0.103   0.163
  0.826  -0.054  -1.134   0.436  -0.444     -0.284  -0.244  -0.164  -0.164

In [40]:
μ=mean(X,dims=1)
σ=svdvals(X.-μ)
plot(σ)


Out[40]:
1-element Array{PyCall.PyObject,1}:
 PyObject <matplotlib.lines.Line2D object at 0x00000000523813C8>

In [41]:
k=20
T=myPCA(X,k)
out=kmeans(T',k)
myPlot(out.assignments,T,k)



In [43]:
# Clustering the transpose
k=20
T=myPCA(Matrix(X'),k)
out=kmeans(T',k)
myPlot(out.assignments,T,k)



In [ ]: